System Message Free until April 26th, 2025

Deploy and Use Open Source GPT Models for RAG

You work as a network engineer at a renowned system integrator company. You are tasked with configuring a broad range of networking devices—from enterprise-level Cisco Catalyst switches to Cisco Nexus 9000 devices in data centers. Keeping track of configuration details across these platforms is a challenging task. Although you are adept at reading and writing technical documentation, it still takes you a considerable amount of time. Despite having a subscription to a cloud AI provider that could streamline your search process, company policy restricts you from uploading any confidential information to the cloud AI provider. Alternatively, you think of deploying a comparable AI solution on-premises yourself.

While searching for an appropriate on-premises solution, you come across various open-source GPT models and chatbot applications capable of managing general IT tasks. A standout discovery is the open-source Open WebUI application, which incorporates an Ollama inference server and offers a user-friendly chat interface. This interface is equipped with advanced features such as Retrieval Augmented Generation (RAG), allowing you to upload files and use them as reference data for the GPT. Remarkably, deploying this application is straightforward, requiring just a simple Docker command. You decide to try Open WebUI.

To proceed, you need a computer—either physical or virtual—with a GPU, which considerably enhances processing speeds. Fortunately, the IT Ops department has provided you with a Linux VM that is equipped with 8 GB of GPU RAM and Docker already configured to use the GPU resources.

To deploy the Open WebUI application, you run the following Docker command that you found in the official documentation: docker run -d -p 3000:8080 -e WEBUI_AUTH=False --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama.

This Docker command starts the container in detached mode by using -d, which allows it to operate in the background without occupying the terminal. The command also maps port 3000 on your VM to port 8080 on the container via -p 3000:8080, enabling you to access the application through your VM IP address on port 3000. Authentication is disabled with -e WEBUI_AUTH=False for easier access during initial tests. The --gpus=all option allocates all available GPU resources to the Docker container to help ensure optimal performance.

Further, the command mounts volumes with -v ollama:/root/.ollama and -v open-webui:/app/backend/data for storing the downloaded models and application data, respectively. The container is named open-webui for straightforward management and is set to automatically restart with --restart always should it stop unexpectedly, such as during a reboot. Finally, the ghcr.io/open-webui/open-webui:ollama specifies the Docker image that is sourced from the GitHub Container Registry.

When you press the Enter key, you feel relieved that you have avoided a lengthy and tedious installation and configuration process. After a few seconds, the Open WebUI application is up and running on your VM, providing a robust, in-house solution for your documentation and configuration search needs while remaining compliant with your company's data security policies.

Get Started with Open WebUI and RAG

With the Open WebUI application up and running, you begin searching the web to understand how Retrieval Augmented Generation (RAG) functions. You learn that RAG relies on a technique called semantic search, which identifies relevant context within the files you upload to the RAG system. Unlike traditional search methods that search for exact keyword matches, semantic search aims to grasp the intent and contextual meaning behind the words in a query.

A key component that enables semantic search is the embedding process that transforms the text from your queries and potential information sources into numerical vectors. These vectors are essentially lists of numbers that represent text in a high-dimensional space. You can think of these vectors as coordinates on a map where texts with similar meanings have vectors that are closer together. For example, the words "apple" and "banana" will be placed close to each other because both are types of fruit, while the word "keyboard" will be further from the fruits, as it relates to a different context. The same principle applies to phrases or even entire paragraphs.

The files you add to the RAG system undergo several processing steps. First, the text is extracted from the files. This text is then divided into smaller sections called chunks. Chunking is necessary because GPT models have a limit on the number of characters they can process at once. This limit is known as the context size of the model. This context size varies between different GPT models. The way text is chunked significantly impacts the quality of the answers because each chunk should ideally capture all semantically similar content—but you will explore that in more detail later.

After chunking, the sections are transformed into vectors using an embedding model, which is a specialized Large Language Model (LLM). These vectors are then stored in a vector database, optimized for efficiently handling high-dimensional data. You can think of the vector database as a lookup table, where each vector serves as a key linked to its corresponding raw text. The entire process is illustrated in the next figure.

When you ask a question using the RAG system, it first uses the embedding model LLM to convert your question into vectors. It then uses these vectors to search the vector database for the information that is most relevant to your query. The text from the best matches is combined with your original question to create a new expanded context. This expanded context, along with your initial query, is sent to the inference LLM in a raw text format. The inference LLM uses this information to provide an accurate answer, enhancing its built-in knowledge with the most relevant data extracted from the database. You see the entire process shown in the next figure. Note, that the same embedding LLM is used for both creating the database from the uploaded files and for a semantic search with queries.

Satisfied with your high-level understanding of the RAG pipeline, you begin to explore the Open WebUI application.

Step 1

Type http://localhost:3000 in the URL field and press the Enter key.

Answer

You should see the Open WebUI landing page.

Step 2

Focus on the navigation panel on the left.

Answer

You can hide the navigation panel by clicking the hamburger icon in the top-right corner of the panel. The New Chat button opens a new chat interface on the right side of the application using the default model, which is currently the Arena Model. The Workspace button opens additional settings where you can set the inference GPT model, upload files for RAG, create prompt suggestions, and even include tools such as web search. The Search option lets you search by tags. Currently there are no tags. Previous chats are located below the All chats drop-down list. Since you have no chat history the drop-down menu is grayed out, and there are no chats visible.

Step 3

Focus on the main chat interface on the right.

Answer

Notice the suggested prompts below the main chat window in the middle of the panel. Note that the default suggested prompts are generated every time you refresh or load the interface. Because this is a simulation, you may see the same suggestions. However, note that these suggestions might differ from the figures in the lab guide due to refreshing of the browser during simulation development.

In the top-left corner of the chat interface you see the currently selected chat model, which is currently the Arena Model.

Do not confuse the chat model with the underlaying GPT inference model. The Arena Model is just a placeholder for a fully-configured chat interface. You must select an inference model for the chat interface to be active.

Step 4

Click the Arena Model text in the upper-left corner of the chat interface to open a drop-down menu listing all available inference models.

Answer

You should see the models listed in a name:tag format followed by the number of parameters that make up the model.

The phi3, and llama3.1 are categorized as instruct models. Instruct models are specifically trained to excel at tasks that require following user instructions. These models are particularly well suited for chatbots because they can interpret to a wide range of user queries and commands and respond to them effectively. The mxbai-embed-large model is trained specifically for embedding tasks. This model is used in RAG applications for semantic search.

The models featured in this selection are classified as small LLMs because they possess fewer parameters. Models with a greater number of parameters typically demonstrate enhanced capabilities, handling more complex tasks and offering more nuanced responses. Each model here is designed to operate on a consumer-grade GPU with 8 GB of RAM, making them accessible for smaller-scale applications.

To estimate the RAM consumption of a GPT model, you can use the formula: M = P x Q/8 x 1.2, where M is the estimated memory in Gigabytes, P is the number of parameters in billions, Q is the quantization in bits, and 1.2 accounts for additional overhead related to inference optimization tasks. Quantization defines how many bits are used for storing each parameter in memory. It's important to note that while quantization helps reduce memory requirements by compressing model size, it can also lead to a decrease in performance.

For instance, the phi3 model, which has 3.8 billion parameters and uses 4-bit quantization, would have an estimated GPU consumption calculated as follows: 3.8 x (4/8) x 1.2 = 2.28 GB of RAM. Similarly, the llama3.1 model, with 8 billion parameters and 4-bit quantization, would require approximately 4.8 GB of RAM. This approach ensures that the models are both efficient and effective within the constraints of typical consumer hardware. To put things into perspective, GPT-3 has 175 B parameters and runs 16-bit quantization which consumes around 420 GB of RAM.

The RAM consumption for RAG applications can be calculated by considering both the inference and the embedding models. For example, the mxbai-embed-large embedding model with 334 million (0.334 billion) parameters and using 16-bit quantization, requires approximately 0.8 GB of RAM. When combined with the llama3.1 model, the entire RAG system has a baseline RAM requirement of at least 5.6 GB. Additionally, optimization processes such as data caching, activation map storage, and parallel processing overhead are actively running on the GPU, all of which demand extra RAM. These processes are essential for efficient data retrieval, quick access to frequently used data, and managing the computational load across multiple GPU cores. Therefore, allocating 8 GB of GPU RAM for the system provides sufficient headroom to accommodate these demands and ensures smooth operation under varying workloads.

Note

The models were downloaded at the time of writing this guide. The latest tag means that the models were at their latest version when the lab guide was created. These open-source models are quite frequently updated so keep in mind that the latest versions might behave differently. You can download additional models from huggingface or ollama portals, or even use cloud services, such as ChatGPT from OpenAI in the administration settings of the application. Since this is a simulation, you will only be able to choose the llama3.1 model.

Step 5

Choose llama3.1:latest from the menu.

Answer

Notice how the chat model name changes to the inference model name.

Step 6

Click in the chat prompt field with the How can I help you today? text.

Answer

You should see the chat interface change slightly.

In a real environment, this transition happens in real time when you click and start writing the prompt.

Step 7

Click the chat prompt field again and write Tell me a joke in the chat prompt.

Answer

In a real environment, you would first see only your text, and after a few seconds, you would see a completion suggestion from the system.

Note

Because this is a simulation, you cannot experiment using different prompts and are limited to the ones specified in the instructions. You must enter the prompts exactly as they are specified in the instructions, otherwise the simulation will not continue.

Step 8

Press the Enter key to continue with the simulation.

Answer

You should see an example of automated prompt completion.

Similar to how prompt suggestions are randomly generated with each page refresh, the prompt completions will be different even for the cases where you use the exact same starting prompt. This is due to the non-idempotent nature of GPT models, that are used to generate these completions. The non-idempotent nature means that using the same input does not necessarily produce the same output.

Step 9

Click the up-arrow icon or press the Enter key to ignore the suggestion and send the prompt to the GPT.

Answer

Since this is a simulation, you will get the answer instantly.

However, in a real environment you would normally see a loading screen first.

In case of the very first query, it takes a while for the system to load the GPT model onto the GPU. Specifically, the llama3.1 model, which contains approximately 4.9 GB of data, needs to be transferred to the GPU memory. This process, which is known as cold-loading, can be time-consuming but is essential for preparing the model for efficient operation. Cold-loading occurs when the model is loaded from scratch, which takes a couple of seconds for smaller models like llama3.1. However, imagine loading a much larger GPT-3 model that encompasses around 350 GB of data. Once loaded, any subsequent usage can benefit from what is known as hot-loading, where the model remains in memory, ready for immediate use without the initial delay.

GPT models usually stream the answer as it generates in real-time, which ensures that the AI system feels responsive by matching the speed at which an average person reads silently—typically 200 to 300 words per minute, or about 3-5 words per second. Thanks to the parallel processing capabilities of GPUs, such system can deliver outputs even faster. However, it's important to note that models with higher parameter counts might require more time to process the same input, potentially affecting the response speed.

Step 10

Focus on the various options below the answer.

Answer

Since this is a simulation, you will only be able to choose the regenerate option.

The options as they appear from left to right are as follows:

  • Edit (pencil icon) allows you to change the response and copy the final result.

  • Copy (clipboard icon) allows you to copy the response as is.

  • Read Aloud (speaker icon) uses the text-to-voice capabilities of the app and reads the text to you, provided you have sound properly configured on your computer.

  • Generation info (the information icon) shows performance metrics of the response, such as response tokens per second, prompt tokens per second or the total time it took for the entire query. You can hover your mouse over the icon to see the results for your query. Note, token count is not the same as word count. Word count can be estimated as 3/4 of the token count. For example, 100 tokens per second is roughly 75 words per second.

  • Thumbs up and down icons let you evaluate the answer as good or bad to help you keep track of the chat history and qualitatively assess your interactions.

  • Continue Response (the play icon) instructs the model to continue with the generation in case the answer streaming pauses, which might happen when the answers become really long.

  • Regenerate (recycle icon) lets you process the very same prompt again. This is helpful when you want to take advantage of the non-idempotent nature of GPTs and get a slightly different answer to the same query without copy-pasting the same query again.

Step 11

Click the regenerate icon.

Answer

It is the last option on the right.

You will see a different answer compared to the one before.

Notice now a <2/2> option appearing to the left of the pencil icon. This lets you know that you are currently viewing the second out of two possible answers to the same prompt. In a real environment, you can click the < symbol to go back to the previous answer, whereas clicking the > symbol will move to the next one.

Step 12

Notice how this chat is saved in the main navigation panel on the left.

Answer

The chat names are dynamically created. In this case, your chat is filed under the name Laughter is Contagious.

Keep in mind that in a real environment, these chat classifications are dynamically generated and the exact same chat might get filed under a different classification.

Step 13

Click the ellipsis in the chat entry in the main navigation panel.

Answer

You should see several options pertaining to this chat.

In a real environment, you could select any one of these options, but in this simulation you will only be able to choose the Delete option.

Step 14

Click Delete in the menu to delete this chat and cleanup the workspace.

Answer

You should see a pop-up window asking for confirmation.

Step 15

Click Confirm.

Answer

You should now see a clean workspace without the previous chat.

Now that you understand the basics of the Open WebUI, you decide to configure it for RAG.

Configure RAG in Open WebUI

A high-level overview of how RAG works informed you that there are a couple of things that need to be configured in the application. First, you have to specify which embedding model you want the application to use and choose the inference model that generates the final answer. Next, you should set the chunking parameters. You remember that there was an embedding model called mxbai-embed-large readily available in the environment and that you have already tested the llama3.1 inference model. Finally, you will set the chunking method and put it all together within a chat interface to start testing with some real prompts involving general networking know-how. You delve deeper into Open WebUI documentation and start configuring.

Step 16

Click the user icon in the bottom-left corner of the main navigation panel.

Answer

You should see a pop-up menu containing various settings.

The Settings option lets you configure general UI settings, such as interface layout, personalization, and so on. The Archived Chats option lets you access the saved chat history for the currently logged in user. Playground offers an easy-to-use chat interface where you can define and test various System Instructions, which are detailed guidelines for how you want your chatbot to behave and respond. The Admin Panel lets you manage GPT models (on-prem or cloud provided), and RAG settings, among others.

Step 17

Click Admin Panel in the menu.

Answer

You should see the Users tab selected with the Overview view opened.

Notice how the currently selected Users tab is colored black, whereas the Evaluations, Functions, and Settings tabs are gray.

The Evaluations tab allows you to view how users of this Open WebUI instance have rated the models that they interacted with, based on the thumbs-up and thumbs-down scores for each chat. This feature is currently in beta in version 0.4.7 of the application.

The Functions tab enables you to discover, download, and explore additional custom tools for Open WebUI, including advanced web search features and custom Python code that can be integrated with GPT models.

In the Settings tab, you can find GPT model, RAG, and other advanced settings.

Step 18

Click the Settings tab.

Answer

You should see the Settings page where the General tab is currently selected.

There are a lot of settings for a complex and versatile application such as Open WebUI. You will focus only on the settings that are important for the RAG application and leave the rest at their default values. The settings you are interested in can be found under the Connections, Models, and Documents tabs. You will explore and configure these settings in the following steps.

Step 19

Click Connections in the navigation panel.

Answer

You should see the OpenAI API and Ollama API sections.

Step 20

Focus on the OpenAI API section.

Answer

Here you can manage access to the on-prem or cloud-hosted inference servers that communicate over the OpenAI API.

Here you can enable or disable the OpenAI API connections whenever you need to. You can also add URLs that lead to the cloud-hosted (default) or on-prem servers by clicking the + icon.

If you want to use the cloud-hosted OpenAI inference server, also known as ChatGPT, you have to be a registered user with OpenAI and own an API key. You can add the API key in the corresponding text field next to the URL that will serve as the default API key. The cog icon in the same row lets you define which models from their portfolio you want to use or even use a different API key per model.

You will not be using cloud-hosted services, and you don't have access to any on-prem OpenAI inference servers.

Step 21

Click the toggle button to disable the OpenAI API.

Answer

You should see both the OpenAI API and Ollama API in a disabled state.

Step 22

Click the toggle button in the Ollama API pane to enable the Ollama API.

Answer

You should see the Ollama API connection enabled and set to the default settings.

Here you can manage the connections to the Ollama inference server instances. This simulation is based on a locally hosted Ollama inference server.

The default URL is set to localhost on port 11434. In a real environment, you could connect the OpenWeb UI to another Ollama inference server by configuring the URL and port of the other instance.

The cog icon lets you configure additional settings, such as adding a prefix to the existing connections to avoid conflicts. Because there is only one local connection, you can ignore this setting.

The wrench icon allows you to manage the embedding and inference models hosted by a specific Ollama inference server. Even though the models used in this lab guide were already downloaded, you will explore these settings in the following step to learn how to find and add other models to the system.

Step 23

Click the wrench icon for the Ollama connection.

Answer

You should see the Manage Ollama pop-up window.

At the top, you should see the Pull a model from Ollamma.com section. In a real environment, you can choose to download any of the models hosted on the Ollama model repository by typing the models name and accompanying tag. You will explore how to get this information in the next step.

You can delete any of the downloaded models in the Delete a model section by choosing it from the drop-down menu and clicking the trash can icon. The next figure shows this example.

For more advanced use, you can also augment some of the run time model parameters of the supported models in the Create a model section. You will be using the default settings in this lab, so you can ignore this part.

Expanding the Experimental section reveals the support for downloading models in the GGUF format directly from the Hugging Face platform. This feature was still under development at the time of writing this tutorial and exploring its functionality is not included in this simulation.

Note

Different model formats address various needs, such as SafeTensors for enhanced security and GGUF for optimized quantization, enabling efficient performance and broader compatibility. The Ollama inference server supports both formats.

Step 24

Click the click_here link in the Pull a model form Ollama.com section to access the Ollama model library.

Answer

In a real environment, you would see the Ollama model library open in another tab. In this simulation, you are presented with the full screen view of the tab.

Each model has its name written in bold, a description and a number of parameters. The library is frequently updated and you might see different models than the ones shown in the figure if you opened the page outside this simulation. Looking at the figure you see that llama3.3 has 70 billion parameters and, according to the description, offers similar performance compared to a much larger llama3.1 model that uses 405 billion parameters.

Additional model details such as quantization are available by clicking the model name in the library. You will explore llama3.1 variants, because this is the model currently configured in the OpenWeb UI.

Step 25

Press the down-arrow key on your keyboard.

Answer

You should see llama3.1 in the middle, among other models.

Note

In a real environment, you can use the scroll-wheel or the scroll-bar to navigate the page.

Step 26

Click llama3.1 from the list to see additional variants you could use.

Answer

You should see the default model variant for llama3.1.

The table shows the last time that this variant was updated. At the time of writing this tutorial, this model was two weeks old.

The model row tells you the GPT architecture, number of parameters, and level of quantization. In this case, the architecture is llama, the model has 8.03 Billion parameters, and it uses 4 bit quantization, denoted by the Q4 prefix of the Q4_K_M string. The K_M demarcation refers to the quantization method and optimization.

The params and template fields are used to control the behavior and formatting of the model’s responses. For example, the stop sequence in the params field defines when the model should stop generating text. The template section defines how the input prompts are formatted before being passed to the model, which is already taken care of within the OpenWeb UI.

Step 27

Click 8b to reveal a drop-down menu with other model variants.

Answer

You should see the 8b, 70b and 405b parameter variants.

The drop-down shows three variants, which does not mean that there are only three available variants. The three variants shown in the list are the Q4_K_M quantization variants. The View all menu option reveals all variants.

Step 28

Click View all in the menu.

Answer

You should see a list of all variants that are offered by the llama3.1 family.

If you want to download any of these specific models, be sure to use the full name written in bold. Look at the figure of model variants. To use the llama3.1 405 billion Instruct model with quantization of 16 bits, you would have to use 405b-instruct-fp16 as a tag. So, a full model name in this case would be llama3.1:405b-instruct-fp16.

To download this model, you would write llama3.1:405b-instruct-fp16 in the Pull a model from Ollama.com field and click the download button, as shown in the next example.

Step 29

Press the Enter key to return to the Manage Ollama page.

Answer

Step 30

Click the X icon in the top-right corner of the page to close the Manage Ollama page.

Answer

You will return to the Settings view.

Step 31

Click Models in the navigation panel.

Answer

You should see a list of all the currently downloaded models.

Here you could use the toggle buttons to effectively hide the models from other non-administrative users. Clicking the pencil icons lets you add detailed descriptions, default prompts, and other metadata that all the users see. Leave all these settings as is because you will configure your own chatbot using one of these models as a baseline after you are done configuring everything for RAG.

Step 32

Click Documents in the navigation panel.

Answer

You should see all the settings for document processing and RAG.

Step 33

Focus on the General Settings pane.

Answer

You can download various embedding models and reference them here to use with RAG. The Embedding Model Engine is currently set to SentenceTransformers. Sentence Transfomers is a model repository, similar to Ollama's model library and Hugging Face portal. You will use the Ollama server for the embedding model engine.

Step 34

Click Default (SentenceTransformers).

Answer

You should see a drop-down menu with the Ollama and OpenAI options.

The OpenAI engine points to the cloud-hosted embedding model provided by OpenAI. Same as for the OpenAI inference model, you must have a registered account at OpenAI and an API key. The same pricing policy applies in this case as well.

Step 35

Click Ollama in the menu.

Answer

You should see the http://localhost:11434 URL, which points to the Ollama inference server.

Below the Ollama URL, you should find the Embedding Batch Size parameter, which is currently set to 1. The embedding batch size determines how many text chunks are fed into the model at the same time during the embedding generation process. Larger batch sizes allow the system to process multiple text chunks in parallel, increasing throughput and reducing the total time required to generate embeddings for large datasets. Each batch of text chunks consumes memory that is proportional to the size of the text, the batch size, and the complexity of the model. Smaller batch sizes are more memory-efficient and suitable for devices with limited resources.

Next, you will see the Hybrid Search option, which is currently set to off. Turning this option ON enables the use of re-ranking. Re-ranking is a process that takes the initial list of retrieved search results based on their relevance scores. The re-ranking process then refines their order using a more advanced and specialized model called a re-ranking model. While the initial search focuses on quickly finding documents that match the query, re-ranking dives deeper into the context and meaning of both the query and the retrieved results. This ensures the most relevant and high-quality matches are prioritized at the top of the list. Re-ranking is particularly useful when you need highly precise results in scenarios where relevance can significantly impact user experience, such as question answering or document retrieval in a critical domain. However, re-ranking requires an additional specialized LLM that is trained specifically for this task, which adds computational overhead. You will not use re-ranking in this simulation.

Further down in the interface, you will see the Embedding Model pane.

Here you define the embedding model that you want to use. You can download models in the Connection settings, as described in the earlier steps. The mxbai-embed-large model was already downloaded for the simulation. You will set this model for your RAG application. There are a lot of different available embedding models and they differ by how fast and in which manner they convert text into vectors. Please note that changing the embedding model requires you to re-process all the text because different models create different vectors from the text. The mxbai-embed-large model offers a good compromise between the quality of the embeddings (text-to-vector translation) and processing speed.

Step 36

Type mxbai-embed-large:latest in the Embedding Model field and press the Enter key.

Answer

Your screen should now look as follows:

Step 37

Take a look at the Content Extraction settings.

Answer

The Content Extraction settings define the engine that you will use to extract the text from the documents and parse it. The default setting uses an internal mechanism for this process. In a real environment, you can click Default, which opens a drop-down menu where you can select the Apache Tika server. Apache Tika is an advanced parsing engine that can detect and extract metadata and text from over a thousand different file types such as PPT, XLS, and PDF.

Step 38

Press the down-arrow key on your keyboard.

Answer

You should see the Query Params section.

Notice that the Top K parameter is set to 5. This parameter specifies how many of the highest-ranked database entries are selected from the retrieval phase based on their relevance scores. For example, if Top K is set to 5, the system retrieves the 5 most relevant entries for a given query from the database to be used as context. The context is then passed on for further processing such as re-ranking or directly feeding into the generative model. If Top K is too low, there’s a risk of missing relevant information, leading to incomplete or inaccurate answers. On the other hand, if Top K is too high, the system may process irrelevant content, which could reduce the quality of the response.

Below the Top K parameter, you can see the default RAG Template. The default RAG Template acts as a system prompt specifically designed for queries that require retrieving information from the knowledge base (vector database). Without this template, the GPT model might generate irrelevant responses such as stating that it found information in the knowledge base. The template also includes explicit instructions for the model to acknowledge when it doesn't have sufficient information to answer a query. This guidance is crucial in mitigating hallucinations and minimizing inaccurate or erroneous responses. By clearly defining the model's behavior and providing a structured framework, the RAG template ensures more accurate and reliable outputs while maintaining alignment with the retrieved knowledge base. The structure of the system prompt is carefully crafted using XML-style tags, which play a vital role in distinguishing instructions from placeholders. These placeholders are used to insert the retrieved context and the original query dynamically. Note that you can use any markup language or templating syntax to define the placeholders for the context and the query. Do not change the template because it can significantly impact the responses of the RAG system.

Note

In a real environment, you can use the scroll-wheel to navigate the page.

Step 39

Press the down-arrow key again.

Answer

You should see the Chunk Params pane followed by the rest of the Documents settings.

The Chunk Params section is used to configure the Chunk Size and Chunk Overlap parameters, which play a critical role in splitting text into smaller, manageable pieces for representation in a vector database. These parameters directly influence how text is segmented and stored as vectors, affecting both the accuracy of search results and the performance of the retrieval process.

The Chunk Size parameter determines the number of characters or tokens that will be included in each chunk of text. A larger chunk size allows each vector to represent a longer span of text, which provides more context and can be particularly useful when the goal is to retain broader ideas or generate summaries. However, larger chunks can make the search results less precise because they cover more content within a single vector. Conversely, a smaller chunk size divides the text into more granular pieces, which allows for finer search accuracy and improves the system’s ability to retrieve very specific details. For instance, a chunk size of 200 characters works well for highly detailed queries, whereas a chunk size of 1000 character is better suited for broader queries or long-form content where retaining larger contexts is critical. You can leave the settings for Chunk Size at 500 because this setting offers good granularity for the prompts that you will use during the simulation.

The Chunk Overlap parameter determines how much of the text in one chunk is repeated in the next. This overlap helps preserve continuity between the chunks, ensuring that important information spanning across boundaries is not lost. Higher overlap values are particularly useful when dealing with text that contains flowing ideas because they maintain context and coherence. However, they also introduce redundancy and increase the overall storage size. Lower overlap values, on the other hand, reduce this redundancy but may risk cutting off meaningful information at chunk boundaries. For example, if the chunk size is set to 500 characters with an overlap of 50 characters, the first chunk includes characters 1–500, and the second chunk begins at character 451, extending to 950. You can also leave this setting at 100 because this value proved to be sufficient considering the Chunk Size and the files you will be working with.

Step 40

Click Save in the bottom-right corner of the page to apply the settings and complete the RAG configuration.

Answer

You should see green notifications informing you of successful configuration.

Step 41

Press the Enter key to remove the green notifications.

Answer

In a real environment, green notifications disappear on their own. Your screen should now look like before you clicked Save.

With the RAG configurations in place, you turn to the documentation regarding using your own files with RAG.

Write Basic Prompts with RAG

Now that you have configured the parameters used in RAG, you decide to upload the files you want to work with. Your current project revolves around Cisco Nexus 9000 switches and you want to simplify your task by uploading some configuration guides and use RAG to help you with configuring the Cisco Nexus 9000 switches. You would also like to upload outputs from show commands, such as show runnning configuration or show cdp neighbors, and other more specific outputs. You are hoping that providing this information would be enough for the RAG system to have a good overview of what is already configured and be able to help you with some additional configurations.

You are a bit worried about the impact of different files on the quality of the answers because configuration syntax and the output from various show commands is drastically different from the language used in configuration guides that you intend to use. In addition, you do not know whether llama3.1 was trained to recognize and work with Cisco configuration syntax. Also, knowing how chunking works, you have second thoughts about Chunk Size and Chunk Overlap that you configured and are afraid that the text stored in vector database entries might not capture all the relevant context. Specifically, you are worried about splitting the outputs of show commands in the same way as you split the configuration guides. You decide to test the current settings using the files as they are and simply see if it does the job. First, you will upload all the files and use the OpenWeb UI's built-in pipeline that processes and stores the files in vector format in the vector database. Next, you will test how well the llama3.1 and RAG settings retrieve data from the vector database.

Step 42

Click Workspace in the main navigation pane.

Answer

You should see the Workspace page.

The Workspace page is used to manage so-called Models, or ChatBots. You can define a ChatBot to use a specific inference model, add instructions on how you want the bot to behave, and even add a database for the model to use. For now, you will upload the output from various show commands and make them available for all ChatBots.

Step 43

Click Knowledge at the top.

Answer

You should see the existing knowledge collections that were pre-loaded for this lab simulation.

A knowledge collection comprises one or more uploaded files that are stored in a vector database. The collection serves as a single reference to multiple files so you do not have to specify to GPT which file to check for data on each prompt.

The first collection on the left, titled Nexus 9K Complete Configuration Guide, includes the entire PDF of Nexus 9300 Interface Configuration Guide downloaded from: https://www.cisco.com/c/en/us/td/docs/dcn/nx-os/nexus9000/105x/configuration/interfaces/cisco-nexus-9000-series-nx-os-interfaces-configuration-guide-release-105x/m_overview_9x.html

The second collection, named Nexus 9K L3 Interfaces Guide, includes only a PDF of a chapter for Layer 3 interfaces from the entire configuration guide.

These two collections will be used to show how much influence splitting of the initial dataset has on the quality of a RAG system.

Step 44

Click the + icon to create a new collection.

Answer

You should see the Create a knowledge base page.

Step 45

Name the knowledge base Fabric Information and add the following description: Output of various show commands.

Answer

The page should look similar to the following. Note that the page in the lab simulation might look a bit different due to the text input objects in the simulation.

Step 46

Double-check that the entered information is the same as in the instructions and click Create Knowledge.

Answer

You should see a green banner informing you of the successful creation of the knowledge base.

Note

If the simulation does not continue, check the name and description again, since they must be entered verbatim as specified in the instructions for the simulation to proceed.

Step 47

Press the Enter key to continue.

Answer

In a real environment, the green banners disappear after a couple of seconds on their own. You should now see only the Fabric Information collection page.

Here you can manage, add, and delete all the files associated with this collection.

Step 48

Click the + icon to begin the upload and embedding process.

Answer

You should see a drop-down menu.

The Upload files option lets you upload individual or multiple files at once. The Upload directory lets you define a directory from which all the files will be uploaded. The Sync directory option deletes all existing files and replaces them with new ones. The Add text content option lets you add the text that you write yourself.

Step 49

Click Upload files.

Answer

You should see a file browser window open with all the files you intended to use for RAG.

Notice two types of files, PDF, and TXT. The two PDF files are the configuration guides mentioned in previous steps and are already part of their respective collections. You are interested in the TXT files containing the information about deployed Cisco Nexus 9300 switches.

The TXT files contain the output from the various show commands describing the fabric that you are currently working on for the company. The fabric contains one spine and two leaf Cisco Nexus 9300 switches in a CLOS topology. The fabric is configured as BGP with EVPN overlay and also includes all the necessary QoS configurations for AI/ML workloads, which is quite a complex configuration. The TXT files are named by the switch (leaf01, leaf02, spine01) and the command to obtain the output. Each file also has this information written in a header.

This kind of data preparation is necessary for RAG, since it relies on semantic similarities between the prompt and the data contained in the files. In other words, you can't expect from a RAG system to provide data regarding leaf02, if leaf02 is not mentioned in the text itself. For now, keep this in mind since you will put this to the test in the following steps.

Step 50

Press the Enter key to continue the simulation.

Answer

Notice that only the text files (.txt suffix) are selected.

In a real environment, you would have to select the files yourself. However, placing these files in a directory and using the Upload directory option would make this less tedious.

Step 51

Click Open to upload the files.

Answer

You should see all the files processed and their names written on the page.

In a real environment, you see how each file is uploaded and converted into vectors. Each time the system is done with one file, it prints a success message, outputs the filename, and moves to the next one, as you can see in the figure.

Note

The file browser that is used to show how to upload the file belongs to a Microsoft Windows OS. The process is the same for a different OS, but the button to trigger the upload once the files are selected is different for each OS.

Step 52

Click Knowledge to see the collections again.

Answer

You should now see three collections.

You can see when a collection was last changed in the collection card. Notice that it states that the Fabric Information collection was updated a few seconds ago. This update was triggered by the uploading of the TXT files.

The PDF files were uploaded using the same procedure. Although you have learned how to upload files using the output of show commands, this time you'll begin by exploring how the RAG system works with PDF documents. PDFs are easier to understand and provide a more straightforward approach for learning how RAG handles citations.

Step 53

Click New Chat in the main navigation panel.

Answer

You should see a brand-new chat interface window with the default Arena Model selected.

You will change the model to llama3.1 and set it as default in the following steps because you will be experimenting and frequently opening new chats.

Step 54

Click Arena Model at the top.

Step 55

Click llama3.1.

Step 56

Click Set as default to make llama3.1 the default model for every new chat.

Step 57

Press the Enter key to continue the simulation.

Step 58

Click New Chat again.

Step 59

Click in the chat field and type a # symbol. Then press the Enter key to continue.

Step 60

Choose the Nexus 9K Complete Configuration Guide.

Step 61

Press the down-arrow key on your keyboard to insert the following prompt: Can I configure static MAC addresses on tunnel interfaces?

Step 62

Click the arrow icon or press the Enter key to send the prompt.

Step 63

Press the down-arrow key to get to the end of the response.

Step 64

Click the citation link under the answer.

Step 65

Press the down-arrow key to see some other citations.

Step 66

Click the X icon in the top-right corner of the Citation page.

Step 67

Click New Chat in the main navigation panel.

Step 68

Enter the # symbol in the chat interface and press the Enter key to continue.

Step 69

Click the collection titled Nexus 9K L3 Interfaces Guide.

Step 70

Press the down-arrow key to enter the following prompt: Can I configure static MAC addresses on tunnel interfaces?

Step 71

Click the up-arrow icon or press the Enter key to send the prompt.

Step 72

Click the citations link.

Explore Prompt Engineering

Your RAG setup is operational, and you have already learned the common pitfalls and challenges of using RAG, particularly concerning the data used for the database. Now, you want to test how different prompts influence the quality of the answers.

There are many prompt engineering techniques, and you may have used some of them without even realizing it. One of the most basic techniques is called zero-shot prompting, where you simply ask a question without providing any additional instruction, data, or context. This type of prompt is likely to yield poor answers.

A more advanced technique is few-shot prompting, where you include additional context, examples, or specific instructions within the prompt itself. This approach helps GPT provide more accurate and relevant answers by guiding its reasoning process.

Regardless of the prompt technique used, you can control how creative a GPT model is when generating answers by adjusting a parameter called temperature. Higher temperature values encourage more imaginative and diverse responses but can decrease accuracy and increase the likelihood of factual errors.

You plan to try both approaches and experiment with how your system responds to different values of the temperature parameter.

Step 73

Press the Enter key to reset the chat interface.

Step 74

Press the down-arrow key to insert the following prompt: Can you configure VLAN 5010 on an interface of a Cisco Nexus 9000 series switch?

Step 75

Click the up-arrow icon or press the Enter key to process the prompt.

Step 76

Press the down-arrow key.

Step 77

Press the down-arrow key again.

Step 78

Click New Chat in the main navigation panel.

Step 79

Press the down-arrow key to insert the following prompt: Is it allowed to configure VLAN 5010 on an interface of a Cisco Nexus 9000 series switch?

Step 80

Click the up-arrow icon or press the Enter key to process the prompt.

Step 81

Press the Enter key.

Step 82

Press the down-arrow key.

Step 83

Click the up-arrow icon or press the Enter key to process the prompt.

Step 84

Click the chat settings icon.

Step 85

Click temperature in the side-panel.

Step 86

Click the regenerate button.

Step 87

Click the temperature slider.

Step 88

Click the regenerate icon again.

Step 89

Click the regenerate icon one more time.

Explore Prompt Engineering with RAG

Now you will see how prompt engineering works with RAG. You will use the Fabric Information collection as a knowledge base for the system. This collection contains the output of various show commands. You already uploaded the files in one of the previous tasks of this lab.

The fabric you are working with is a CLOS topology with one spine and two leaf switches. The devices are named spine01, leaf01 and leaf02, respectively. The fabric is configured as a BGP fabric with EVPN overlay. Other information will be disclosed during the exercise for verification purposes.

Step 90

Press the Enter key to reset the interface.

Step 91

Click inside the prompt field, write #Fa, and press the Enter key to continue.

Step 92

Click Fabric Information in the drop-down menu.

Step 93

Press the down-arrow key.

Step 94

Click the up-arrow icon or press the Enter key to process the prompt.

Step 95

Click the spine01_running_config.txt citation.

Step 96

Click the X icon in the upper-right corner to close the citation.

Step 97

Press the down-arrow key to add a prompt below the current response.

Step 98

Click the up-arrow icon or press the Enter key to process the prompt.

Step 99

Click the first citation: leaf02_management_vrf_data.txt

Step 100

Click the X icon in the top-right corner of the citation to close this citation.

Step 101

Click leaf01_management_vrf_data.txt citation.

Step 102

Click the X icon in the top-right corner of the citation to close the citation.

Step 103

Click the and 3 more text next to the caret.

Step 104

Click the scrollbar above the prompt field.

Step 105

Click the leaf02_cdp_neighbors.txt citation.

Step 106

This concludes the simulation.

Keep going!